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. Abstract: In this article we propose a maximal a posteriori (MAP) criterion 

for model selection in the motif discovery problem and investigate conditions 
under which the MAP asymptotically gives a correct prediction of model size. 
' We also investigate robustness of the MAP to prior specification and provide 

guidelines for choosing prior hyper-parameters for motif models based on sen- 
sitivity considerations. 



1. Introduction: statistical challenges in motif discovery 



Genome sequeneing projects have led to a rapid growth of databases of genome 
sequences for DNA, RNA and proteins. The task of extracting insight into gene reg- 
ulatory networks from these massive amounts of data represents a major scientific 
challenge. A first step towards understanding the process of gene regulation is the 
J> ' identification of short recurring patterns (about 8-20 nucleotides long), called mo- 

, tifs, in a set of bio-polymer sequences. Motifs correspond to functionally important 

parts of molecules, such as the sites where certain proteins, called transcription fac- 
tors (TFs) bind, to control gene expression. In spite of a plethora of computational 
methods developed to address the problem of motif discovery (Sandvc and Drablos 
[!•")] and Gupta and Liu [7]). most approaches still suffer from a lack of specificity 
' in motif predictions- yielding a high number of false positives that appear to have 

, no known biological significance. Thus, one of the fundamental questions that arise 

is whether the patterns discovered from the sequence data by an algorithm are 
real. Although confirming the biological relevance of these findings often requires 
' further biological experimentation, it is at least important to assess their statis- 

tical significance. We approach this question of model selection from a Bayesian 
viewpoint, using an analytical approximation to the Bayes factor- the maximal a 
posteriori (MAP) scoring criterion. As an evaluation of its performance, it is shown 
that the MAP asymptotically attains several desirable properties. Since the Bayes 
factor or any such Bayesian model selection criterion necessarily involves param- 
eters of the prior distribution, we also conduct sensitivity analyses to judge the 
effect of prior parameter specifications on posterior inference and prescribe robust 
hyperparameter choices for practitioners. 
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1.1. Bayesian stochastic dictionary model for sequence patterns 

For convenience, we view the sequence data as a single sequence S = {xiX2 ■ ■ - Xn] 
of length n. S is assumed to be generated by the concatenation of words from a 
dictionary T> of size D, where T> = {Mi, M2, . . . , Af£)}, sampled randomly according 
to a probability vector p ~ {p{Mi), . . . , p{Mo)). The likelihood of S is 



where 11 ~ {Pi, . . . , Pj.) is a partition of the sequence so that each part P.^ corre- 
sponds to a word in the dictionary, -/V(n) is the total number of partitions in 11, 
and A^'m (n) is the number of occurrences of word Mj in the partition. 

Assume that the first h [b < D) words in the dictionary are the single letters (for 
DNA, h — A letters, {A, C, G, T}). Let p = (pi, . . . , po) denote the word usage prob- 
abilities for the set of all words in the dictionary. If the partition 11 = (Pi, . . . ,Pk) 
of the sequence into words were known, the resulting distribution of counts of 
words TV = {Ni, . . . Nd)'^ would be multinomial characterized by the probabil- 
ity vector p. Suppose we have D — b motifs (words other than single letters) of 
widths Wb+i, ■ ■ ■ ,wd- In the stochastic dictionary framework, "words" Mj, (j = 
&+!,..., D) are stochastic matrices denoted by {0f,+i, . . . , ©d} = ©'^^ For the 
k^^^ word of width w^, its probability matrix is represented as ©^ ~ {9ik, • ■ • , dwkk), 
each of the Wk columns giving the probabilities of occurrence of the four letters at 
that position of the word. When multiple occurrences of word k are aligned, the 
letter counts in the j^^ aligned column, Cjk = {cijk, ■ ■ ■ CbjkY' , (j = 1, . . . w^), are 
characterized by the probability vectors 6jk = {Oijk, ■ • ■ , dbjk)^ {j = 1, ■ • ■ Wfe), of 
a product multinomial model. The count matrices corresponding to the motifs are 
denoted as {Ct+i, . . . ,C d} = C. 

The partition variable 11 can be equivalently expressed by the motif site indica- 
tors, denoted as A = {Aik; i = 1, . . . ,n, k — b + 1, . . . D}, where Aik = 1 (0) if i 
is the start of a site corresponding to motif type k (otherwise). The complete data 



likelihood then is: L(Ar, C, A | ©(^\ p) cx Il^i pf' Ilk=b+i Uj^i ULi • We 



assume a Dirichlet prior distribution for p, p ^ Dirichlet(/3Q), (3q = {Poi, ■ ■ ■ Pod), 
and a corresponding product Dirichlet prior (i.e., independent priors over the 
columns) PD(JB) for @k = {Oik, ■ ■ ■ , ^wkk), {k = b+l, . . . D). A Dirichlet prior is a 
natural choice for this application, modeling the joint prior densities of proportions, 
and is conjugate with the likelihood, leading to easy computation of the posterior 
densities. Let B = (/3i,/32, . . -Pwu) be a 6 x tufc matrix with /S^ = {(3ij, . . . (ibj)'^ ■ 
The conditional posterior distribution of @k \ N,C oc HJ^i 11^=1 ^ijk''^^''^ , which 
is product Dirichlet PD(B + Ck), the pseudo-count parameters B being updated 
by the column counts of the k^^^ word, Ck ~ {cik, ■ ■ - Cuj^k)- The conditional 
posterior of p | N,C is Dirichlet (TV -I- ^q) oc Iliii P;^'^'''" ■ Bayes estimates of 
& = (©, p) may be obtained through a DA procedure using the conditional distri- 
butions P{A \ 0,S) and P{& \ S,A), using techniques of dynamic programming 
(DP) (Gupta and Liu [li]). 

2. Model selection in the motif discovery problem 

A traditional model selection approach is to use likelihood-based criteria, for in- 
stance, penalized likelihoods. Two widely used criteria arc the (i) AIC (Akaike's 



jv(n) 



D 



(1.1) 
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Information Criterion) and (ii) BIC (Bayesian Information or Schwarz Criterion). 
Leroux [ I 2] proved that, under mild conditions, the estimator obtained with the 
number of components selected using AIC or BIC in mixture models is consistent 
for the true mixing distribution, and has in the limit at least as many components. 
In particular examples, however, it has been observed that the AIC and BIC do not 
give identical results: in our case, the BIC may heavily penalize longer motifs, when 
the data set size is large. An alternative for judging the degree of conservation of a 
motif is the KuUback-Leibler information criterion (KLI), that measures the "en- 
tropy" distance of the motif from the background: KLI = X^iLi Yl'j=i ^ij l^S 
As the exact likelihood can be calculated computationally, using the procedure 
mentioned in Gupta and Liu [G] , it is possible to calculate any of the above criteria. 
However, the dependence structure in the sequence models, when the position of 
motif sites is unknown, makes it difficult to accurately determine the sample size 
n for BIC. Also, the KLI is sensitive to the width of the motif- making it difffcult 
to compare the values across different motif widths. The relative performance of 
several of these model selection criteria on experimentally verified models for real 
data sets are given in Section 2.4. 



2.1. The Bayesian approach 



We can alternatively formulate the question as a Bayesian model selection problem. 
In the simplest scenario, it is of interest to assess whether the sequence data should 
be explained by model A4i, which assumes the existence of a single nontrivial motif, 
or A4o, which says that the sequences are generated entirely from a background 
model (e.g., an i.i.d. or Markov model). The Bayes factor, which is the ratio of the 
marginal likelihoods under the two models, can be computed as 



p{S I .Ml) ^ EAloPj^^'S^^ \Mi)dQ 
Pis \ Mo) ^ J^piS^O \Mo)d& 

Pis \ Mo) \ Co)' 



The individual additive terms in the numerator, after integrating out 0, consist of 
ratios of products of gamma functions. To evaluate this sum exhaustively over all 
partitions involves prohibitive amounts of computation. We thus need to resort to 
either Monte Carlo or some approximations. 

It can be observed that p(«S | Mi) is the normalizing constant of p(A | S,Mi) = 
^^p{^M^}^ ^ ci(7(A|<S, TWi), (say) where piA, S \ Mi) is of known form. Compu- 
tational approximations to estimate normalizing constants is a subject of much 
recent research e.g. Meng and Wong [1.3], Chen and Shao [.3, 4], Chib and Jeliazkov 
[5]. We tested importance and bridge sampling (Meng and Wong [13]) methods for 
estimating the ratio, which are possible to implement since we can obtain MCMC 
draws frompi(A | S). However, neither method performs well in most of the appli- 
cations, probably since the MCMC chain is extremely sticky and hence the draws 
do not well represent the true distribution. The high concentration of the density 
of A around the mode motivates us to seek a measure of significance that uses the 
modal information, and such a method is next discussed. 
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2.2. The MAP approximation to the Bayes factor 

An obvious lower bound for (2.2) is p{A*,S \ Aii) /p{S \ M-o), where A* is the 
maximizcr of the ratio. We demonstrate how this lower bound, called the maximal a 
posteriori score (denoted by MAP (A*)), can be used as a model selection criterion. 

Definition 1. The Maximal a Posteriori (MAP) score under the stochastic dictio- 
nary model A4 1 with a single motif of width w and d letters at the "best" alignment 
A*, compared to the background model A4q (with alphabet size d) is given by 

P(S I Ma) Po(«5) 

Let c = X]j=i denote the word matrix column counts, N^i.j^^ = {Ni, . . . , Nd} 
the letter frequencies and /3o[i-d] = {f^oi ■ ■ ■ t Pod} the prior pseudo-counts for the d 
letters, /Sg denoting the pseudo-counts for the entire set of D words (D = d + 1 for 
1 long word) . Then 
(2.3) 



MAP{A*) = 



where N(-) and C(-) denote the word counts and motif column counts respectively, 
as defined in Section 1.1. For any indicator vector A (representing the sampled 
motif locations), we can integrate out & and obtain 

p{A,s\M,) f r(Ar + /3o) r(|/3o 

log I N = log 



p{s\Mo) nr(|Ar + /3o|) r(/3o) 

r{N[,.,d]+c + (3o^,.,^^) T{\(3o[,.,d^\) ' 

' f^O[l:d]\) r(/3o[l:d]) 

7) r(|7|)- 




7l) r(7) 



When there are D — d > 1 motifs (nontrivial stochastic words) in the dictionary, 
the logarithm of the MAP score (denoted logMAP) can be computed as: 

T{N + f3,) r(|/3o|) 



logMAP(A) = log 
-log 



n\N + f3,\) r(/3„) 

r(jV[i^rf] + c + /3o,i,,]) r(|/3o;i^,]|)' 

/3o[l:d]l) r(/3o[l:d]) 
D 




i V- I /r(l7l)l 



k=b+lj = l \l jr- i\/J k=b+l 



where now c = Tlik^d+i ^jk^ and Nt^i-d] and /3o[i:(i] are the same as above. 

In most simulations and actual data sets, we observed that (i) the MAP score 
dominates the other components of the Bayes factor and is often more reliable to 
differentiate the two models than the approximated Bayes factor via importance 
sampling and (ii) the observed logMAP score in i.i.d. data sets (containing no "true" 
motifs) for any selected alignment was always lower than that of the null alignment. 
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2.3. Asymptotic results for MAP scoring 

Wc now derive asymptotic optimality criteria for judging the performance of the 
MAP scoring statistic as a model selection criterion. In this section, we assume 
that Mo ^ Dir(a('')) and \ Mi ^ Dir(^('^+^^), where a = {ai,...adf 
and /3 — {Pi, . . . Pd+i)^ ■ Additionally, assume that the sequence is generated by 
random draws from an alphabet of size d according to some arbitrary probabilities. 
Let us denote the score for the "true" alignment as MAP{A^) (so that MAP{A*) > 
AIAP{A'^)). The key result for establishing the asymptotic optimality of the MAP 
criterion for model selection is as follows. 

Theorem 2.1. Form exact motif instances, each of length w, in a dataset of size 
N, let ^ ^c,{c< ^ hw, and ^ ^ Oq,, {where No^ and Nu 

are the frequency of letter i in the background and motif respectively, ^ 
1, and X^iLi ^oi = !•) In other words, the proportion of motif sequence to to- 
tal sequence data tends to a constant; the proportion of letter i in a motif tends 
to a constant ki and the sample proportions of letters approaches the population 
proportion. Also, let 

d 

r = clogc + ^(6*01 - kicw) log(6'oi - hew) 

i=l 

d 

(2.4) - log^o* - [1 - - 1)] log[l - - !)]■ 

i=l 

Then, r > is a necessary and sufficient condition for MAP{A'^) — ^> oo {as 
iV — > oo, m oo) at a rate exponential in rN . 

If the conditions of Theorem 2.1 are satisfied, MAP{A*), and thus also the Bayes 
factor, approaches infinity at an exponential rate- indicating that asymptotically 
the MAP gives the correct model judgment. Following from this result, we denote 
the expression r in Theorem 2.1 as the MAP divergence factor, or MAPdf- 

Theorem 2.2. The maximum value attained by r in expression (2.4) of Theorem 
2. 1 for fixed c, d and w is 

clogc + (1 — cw) log(l — cw) + cwlogd — [1 — c{w — 1)] log[l — c{w — 1)], 

attained for 9oi = fc^ = i for i = 1, . . . , d. 

It is interesting to note some special cases where Theorem 2.2 holds. 
Case 1: Symmetric base composition, symmetric motif contribution 

With equal background proportions of letters, i.e. 6*0; = ^ for i = 1, . . . , d, and an 
equal contribution of each base to motif h = 2 for i ~ 1, . . . , d, the conditions of 
Theorem 2.2 are satisfied. The positivity condition of expression (2.4) reduces to 

c log c + (1 — cw) log(l — cw) 

(2.5) + cw log d-[l-c{w-l)] log[l - c(u; - 1)] > 0. 

From Theorem 2.2 it is clear that the MAP score for a motif with other than 
symmetric composition will diverge more slowly. Another extreme case is a "repeat- 
based" motif, composed of a single letter repeated throughout the word. 
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Fig 1. (a) Values of MAP divergence factor (MAPuf) increase with motif widths 2 to 50 (num- 
bers on the figure) and values of c for "equal contribution" motif, i.e. an exact motif in which 
each of the d letters occurs the same number of times, (b) Values of MAPbf for motif widths 2 
to 50 and possible values of c for a "repeat-based" motif. Repeat motifs score lower than motifs 
with equal base composition. 



Case 2: Symmetric base composition, "repeat" motif 

With equal background proportions of letters, i.e. ^pi = ^ for z = 1, . . . , d, and a 
repeat motif: e.g., ki = d; fc^ = for i > 1, the condition in (2.4) reduces to 

clogc + — log + ^ dcw^ log ~ dcuu^ 
-[1 - c{w - 1)] log[l - c{w - 1)] > 0. 

(In this case, cw is necessarily < ^.) It is interesting to observe that a repeat motif 
will score lower than a motif pattern with a more varied representation of letters 
(Panel (b) of Figure 1) when all other parameters remain the same. 

Next, we check how these conditions perform in practice for given c and w. More 
precisely, we check the positiveness of the MAP divergence factor {MAPd f ) . Except 
for very low widths [w < 5), the MAPdf is almost always positive (indicated in 
Figure 1), suggesting that the performance of the MAP score as a model selection 
criterion should improve with increasing motif width, which is also observed in 
empirical studies. Figure 1 shows a comparatively lower value of MAP of for each 
w, and a slower increase with w for a repeat-based motif. In both cases MAPdf 
increases with increase in motif proportion c, though the increase is far more marked 
for the motif with equal base composition. 



2.3.1. Multiple motif MAP monotonic property 



In the progressive updating method that is used to include new motifs into the dic- 
tionary, a new motif is judged to be "significant" based on the increase in the MAP 
score after including the new motif. In this section we show that asymptotically, 
the MAP score monotonically increases with the size of the true dictionary, hence 
the MAP difference criterion theoretically tends to include "true" motifs. 

Let us define MAP^ as the MAP score corresponding to the "true" alignment 

for q word types (including d single letters and q ^ d motifs), where the q^^ word 
has k instances in the data set. If /3i > 1 Vi = 1, . . . , d, it is trivial to show that 



MAP^ < MAP^_^ \fl > 0, 
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which means that if no motifs of the new type q are present, the MAP score should 
theoretically show a decrease. Next, we derive conditions very similar to (2.1) which 
show that if the word type q is truly present, the MAP score should increase- 
specifically, we show that logA/^P^ — \ogMAPlj_i oo. 

Theorem 2.3. Let ^ Pi (for i = d + 1, . . . , q) where N = 

Et=iNu. Then MAP^yMAP^_^ ^ oo ( as m, N ^ oo) at an exponential rate 
in rN iff r > 0, where r is given by: 

d 

^ {{Pl - KiWPq+l)l0g{pi - K^WPq+l) - Pi logpj 
i=l 

Proofs of Theorems 2.1 and 2.3 are given in the Appendix. 
2.4- Performance of model selection criteria 

For empirical comparisons, we used four data sets: CRP (S. cerevisiae or yeast), 
GAL4 (yeast), a a [B. subtilis), and skeletal muscle TF MEF2 (human) (Wasscrman 
et al. [17]). The value of the MAP divergence factor {MAPdf) in Theorem 2.1 is 
shown in Figure 2, for motifs in the 4 data sets. The MAPof is calculated for 
motif composition frequency k = (/ci, . . . , k^) matching that of the consensus motif 
(Columns 5 to 8 in Table 1). For all motif occurrence proportions c, it appears 
that the MAPdf exceeds zero for the 3 data sets with longer motifs, viz. GAL4, 
CRP and MEF2- which gives an indication that the consensus motif is likely to be 
evaluated significant (under the conditions of Theorem 2.1). For a a (with lowest 
motif width of 6), the value of MAPof is slightly below zero for very small c, 
indicating that finding the true motif may be more difficult. 




0.00 0.05 0.10 0.15 



Fig 2. Values of MAPof against motif proportion c for 4 data sets: (1) CRP (2) GAL4 (3) 
2 motifs corresponding to ua binding sites in B. subtilis and (4) MEF2 in human sequences, w 
denotes the likely width of the motif based on experimental data, (cf^^ denotes the motif "TATAAT" 
and CT^' denotes "rTGACA".J 
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The MAP was also compared to several likelihood-based criteria in terms of 
model selection performance by using the knowledge of experimentally detected 
motifs for these data sets. If we set our stopping decision rule to be selection of 
motifs that keep the logMAP score positive, this contains the set of all true motifs, 
for every data set, with an increased power of discrimination with increasing motif 
length, which agrees with our theoretically obtained results. The performance of 
the BIC drastically worsens when the motif length increases- perhaps because of 
the higher penalty for the increased number of parameters in the model. 

An objection that may be raised on using the MAP for model selection is that 
it involves the evaluation of a "modal" value instead of a more typical Bayesian 
quantity involving a sample from the population. A sample-based statistic is often 
desirable in order to incorporate the posterior variability of the distribution into 
the measure used. There are two main reasons why we prefer the use of the MAP 
score. First, the analytical intractability of the Bayes factor forces us to choose an 
alternative approach- either computational or analytical. Second, it is difficult to 
get a "good" representative posterior sample from the distribution of interest i.e. 
p{A I S,&j, Ail), hence the lack of a "good" mean estimate. The contribution of 
the MAP to the BF increases as the likelihood surface becomes more spiky (hence 
resulting in the MAP score providing a better approximation) and simultaneously, 
the computational estimates for the BF arc likely to perform worse, as it is more 
difficult to obtain a true representative sample from the distribution. 



3. Sensitivity analysis for prior influence on the MAP criterion 

In a Bayesian analysis, it is of concern to check the robustness of posterior infer- 
ences to the prior specification. Even asymptotically, Bayes factors remain sensitive 
to the choice of prior, even though posterior means may not (Kass [8]). Conjugate 
priors are not automatically robust as the tails are typically of the same form as the 
likelihood function, and the prior remains influential when the likelihood function 
is concentrated in the prior tail (Berger [2]). With a large dimensional parameter 
space, we need to rely on the conjugate form of the prior for analytical and compu- 
tational tractability. However, it is still of interest to see the influence of the choice 
of prior parameters within this class on the posterior inferences, and develop an 
idea of situations leading to highest posterior robustness. For investigating poste- 
rior robustness over a class of priors, it is attractive to consider the e-contamination 
class (Berger [2]), defined by, F = {tt : tt = (1 — e)no + eq; q £ Q}, where ttq is the 
original prior and Q represents any set of distributions to ensure that F contains a 
plausible set of priors. Suppose the posterior distributions no{9\x) and q{9\x) exist 
(they do for a conjugate class Q), and TO(a;|7ro) = Jg f{x\9)Tro{6)d6 and m(a::|7r) are 
the marginals obtained by integrating out 9 with respect to the priors 7ro(-) and 
7r(-)). It is straightforward to see that 7r(6'|x) = Xq^e{x)no{9\x) + [1 — Xq^e{x)]q{9\x), 
where Xn e (x) = (i~<=)™(^Ko) ^ g^-^^^ follows that 

(3.1) E^i9\x) = Xq^,{x)E^„ {9\x) + [1 - A,,,(x)]-B,(0|x). 

However, posterior inference based on the MAP is affected, as the MAP involves 
the marginal likelihood instead of the conditional posterior. Let 

MAP (A*) ^-(•^ ' ^""^^ ' -4*^-^i'^Mg^)^g^ 
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If we are only interested in the effect of the prior 7r(©) for the motif column fre- 
quencies, the MAP for the contamination class F reduces to 

P^{S I A\Mi,0)T:{0)dO 
(3.2) = {l-e)MAP-,,{A*) + ^MAPq{A*). 



3. 1 . Sensitivity analysis for Dirichlet prior for motif composition 7 

Let us denote 6 — {5a, ■ ■ ■ ,St) as the pscudocount vector for the Dirichlet distri- 
bution q{'y). Also, let 5* = 6a + St- For simplicity, we assume that Sa = St = 
Sc ^ Sa = 1 — Then by varying < 6* < 1, we get a range of con- 
tamination distributions q. For e £ (0,1), we evaluate the robustness of prior ttq, 
through the variability in magnitude of the posterior criteria in (3.1) and (3.2) over 
A = {<5; < S* < 1}. Since we do not want the prior to dominate the data, we 
restrict the total pscudocount frequency X]j=i Sj = We assume a single motif of 
width w with observed frequency matrix C. Let djj = E{9ij \ C). Then, (3.1) gives 

% = Aq(x)— ^ ^ 



+ [l-\(z)]-^^^l±iL l<^<w■,l<J<i. 



1 



+ 7, + (Sj - 7,) < 1 + ^ n n 



l f=l Vr(Cy -f <5j) 



Comparative studies were done separately to test the sensitivity of the (1) Dirichlet 
prior parameters /3 for word frequency and (2) product Dirichlet prior parameters 
7 for the column frequencies, as 6 and are posterior independent. Nucleotide 
composition in biological sequences are often asymmetric, e.g. a relatively higher 
frequency of C/G is observed for higher organisms. With known motif PWMs we 
study the effect on posterior inference when we use priors with pscudocount frequen- 
cies reflecting (1) uniform letter frequency (2) data composition-dependent letter 
frequencies or (3) a mixture with components having different letter frequencies. 

Since we are mainly interested in whether the consensus motif sequence is af- 
fected by the choice of prior we study the variation of the highest frequency in 
each column 9* = maxi<j<4{0y }. Corresponding to each choice of S, we have a 
vector of maximal frequencies 6*^ = {Of , . . . ,e*J). Let Of = ■\t\T.s^f denote 
the mean frequency for column i (i = I, . . . ,w). To summarize this information 
over all columns we calculate three distance-based measures, for each (5 G A and 

e g (0, l)-(i) Mean squared distance: D^j = i J27=i \(^f " ^f] > (ii) KuUback- 
Leibler distance: Dj^ = J27=i ^t^ I'-'S fi^' ^^"-^ (^^0 Entropy or Information content: 

All three measures Dm, Dk and De exhibit little variability, only after the third 
decimal place (plots not shown), slightly increasing with the contamination rate e. 
However, the MAP scores, calculated for the equal, data-dependent, 3-component 
and 9-component mixture prior indicate that using the mixture prior distribution 
makes the MAP most robust to the choice of prior (Figure 3). So if the results are 



CRP 




Fig 3. Range in variability of MAP score (in log scale on vertical axis, may exclude an additive constant) at different levels of prior contamination, for data sets (1) 
CRP (2) GAL4 (3) Ujx (TATAAT motif) (4) MEF2 for human skeletal muscle data set. The four panels for each data set correspond to taking the initial prior (i) 
Dirichlet with symmetric base pseudocounts (ii) Dirichlet with pseudocounts proportional to frequency in data (Hi) 3-component mixture Dirichlet (iv) 9 component 
mixture Dirichlet. The horizontal axis on each plot represents the contamination proportion e. 
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Table 1 

Motif and background letter frequencies for the 4 data sets 



TF 




Background 






Motif 




KL 




A 


C 


G 


T 


A 


C 


G 


T 


distance 


CRP 


0.3021 


0.1825 


0.2090 


0.3063 


0.2841 


0.1799 


0.2140 


0.3220 


0.0011 


GAL4 


0.3116 


0.1914 


0.1761 


0.3208 


0.1218 


0.3908 


0.2983 


0.1891 


0.2218 


aA (TATAAT) 


0.3511 


0.1465 


0.1781 


0.3243 


0.4583 


0.0699 


0.0343 


0.4375 


0.1449 


MEF2 


0.2205 


0.2801 


0.2715 


0.2278 


0.6047 


0.0174 


0.0262 


0.3517 


0.6531 



to be evaluated using the MAP score at the mode, it seems to be safest to use a 
prior of Dirichlet mixtures instead of a single density. The most dramatic difference 
is observed in the GAL4 and MEF2 data on using a mixture Dirichlet prior- these 
are also the data sets having the lowest compositional similarity between the motif 
nucleotide frequency and background nucleotide frequency (KuUback-Leibler dis- 
tance in the last column of Table 1). It is probable that the usual DA and the Gibbs 
sampler tend to pick up motifs that are closer in composition to the background 
for a similar reason. Using a mixture Dirichlet prior may be more sensitive towards 
detecting motifs that arc vary compositionally from the background. 



3.2. Local sensitivity analysis 



One problem in doing a careful robustness study in a high-dimensional problem is 
the limitation to mainly one type of high-dimensional prior distribution (the con- 
jugate prior) for which analytical calculations are feasible. An alternative approach 
to judge sensitivity is to look for prior inputs that sharply change the posterior 
answers. Such local sensitivity analysis methods have been extensively developed 
and studied by Lcamer [11] and Polasek [14]. In our framework, we observe that 
the posterior means are locally insensitive. Let fij ~ E{pj \ S, A) 



Then, If 



r„„ < 1. A similar result holds for posterior means of 0. 

However, parameter specifications for both 7 and /3 may have a local impact on the 
MAP. Excluding constant terms, and using Stirling's approximation [1] to expand 
and simplify the F-functions, wc have 



SlogMAP 



E 



log 



+ 7i 



log 



Efc(cifc+7fe) 2 
Eklk 1 / 1 



7j 



2 VEfe7fe 



1 

7j 



Cy+7j Efc(c»fc+7fc) 



So the influence of "fj can be unbounded only for the second term- it increases with 
w and as the ratio of 'Ylik^jlk to 7^ increases. If Ea; 7*: ^ the second term is 

w {\og ^ -f 2^ ~ 5^ , which can be made arbitrarily large if 7^ is small. Under the 
restriction that 7^ = 1, we will get most robust results if the components of 7 
are approximately equal. 

We may also do a similar analysis for the local influence of the prior parameters 
for p, i.e. for /3. Again, we assume there is only 1 motif type under the motif model 
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(index is D). If we differentiate logMAP with respect to (3k, k < D, we get 



+ log 



Nik - Nok 



{N 



Ik 



-pk)iNak 



(3k) 



log 



D-l 



N 



Ef^r/3.- 

(3d 



(3.3) 



where Ni and TVq are the frequencies under the motif (A^i) and null [Mo) models. 



Let L denote the total length of the data. Then, L 



wNd, and Ef=i ^ij = L ~ Nd(w — 1). If (3d < J2f=i Pj^ t^*^ influential part is: 



Nd{w~1)^(3d 



Nd{w~1)~Pd) {L + Efjif3j) 



log 



NDiw~l)~(3D 



L 



Now, Nd{w — 1) < L. So if (3d < Ef=i Pj^ ^he influence of Pk is seen to be 

negligible for k < D. On the other hand, if Pd > Ef=i Pj^ then the influence of 
the third, fourth and fifth terms in (3.3) may be unbounded, but this is an unrealistic 
parametrization in practice, as the motif proportion is usually low compared to the 
size of the data. Finally, taking derivatives of logMAP with respect to Pd, we have, 



Pd)E 




D 



^ i{Ni-j^ 



-pj) 



The only influential term is the third, and the influence of Pd can be unbounded if 
Ef=i Pj Pd- So ideally, wc want Pd < Ef=i Pj^ but also that the difference 
is not extremely small. In other words, results are most robust to the choice of Pd 
when motif proportions are comparatively large. Having prior knowledge of the true 
motif frequency would make it easier to get more accurate results- by choosing Pd 
to give the expected motif frequency close to the prior knowledge. In practice, the 
choice of Pd is seen to have a great impact on motif detection and evaluation, and 
this remains one of the greatest stumbling blocks of this probabilistic framework. 



4. Discussion and conclusions 

A MAP criterion is proposed for model selection in the motif discovery problem. 
Analytical and computational investigations establish conditions for the the MAP 
criterion to asymptotically predict the correct number of motifs and attaining these 
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conditions is seen to be feasible in a real scenario. We investigate the sensitivity of 
the MAP criterion to prior specification and provide guidelines for choosing these 
hyper-parameters to maximize robustness. The MAP is seen to perform well as a 
model selection criterion in preliminary studies, paving the way for further analysis 
of its performance when adapted to more complex realistic models. 

Appendix A: Outline of proof for Theorem 2.1 

From expression (2.3), we have 



MAP{A^) 



nti(^o, + ai-l)^'«+"'-^ 



11 / ^ ^ X /c(a,/3,7). 



where 



Denote X^^^i = iVi,d+i = EiLi = EiLi iVi^ + ^ YL'ltl 

N ~ (w ~ l)m. Then, 



MAP{J^) « fc(a,/3,7)xl| 



»=1 ^0* 




(A.l) xH 



I Ei^i A - 1 - (w - 1)to j 
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Let the symbol = denote "equal in order of convergence" , i.e. ignoring terms that 
converge to a constant in the limit, as A'^ — > c», Noi — > oo, Nu — > oo. Also, as 
m oo, ^ — > c . Then, taking logarithms, expression (A.l) 

d d 



fd+1 \ 

^A-- log[l-c(«;-l)] 



\j=l 

w{d-l) 



2 

logfl 



d d+l 

(w - i)m + - 



E 



logiV, 



Oi 



i=l 1=1 

w{d - 1) + 1 



logiV 



logm 



_ 1 _ [1 _ c(w - l)]7Vlog[l - c(w - 1) 

c»3 + 7j - 1 



(A-2) +EEf^»^ +^J~ Jllogf- 



since 



AT 



1 - 



[1 - c{w - l)f 



(same for limit in m). Under the conditions of the statement, Noi — > oo, A'^i^ 

^°'m^^' '^i^' "TT ^oi- Now, assume that the motif is "exact", i.e. = 
m for some j, every z. Now let us denote 7+ = maxi<j<d"/j , 7^ = minK^xd 7j • 
Then, the last term of (A. 2) is greater than 

(A. 3) w ^TO + 7^ — log(7Ti + 7^ — 1) ^ Nwclogc + w ^cN + 7~ — logiV. 

Taking the limits of all non-constant terms in expression (A. 2), and simplifying, we 
then have 

logM^P(A") 

■ d 

^{Ooi - hew) \og{9oi - hew) 

i=l 

d 

- J2 log % + c log c - [1 - c(w - 1)] log[l ~e{w-l)] N 



(A.4) 



i=l 

d ^ 

w-f- -w'^jj-- 



logA^. 



So a sufficient condition for MAP{A^) — > 00 at an exponential rate is 



(A.5) 



;logc + ^(6*0; - hew) log(6'oi - hew) 
1=1 
d 

- ^ 9o, \ogOo^ - [1 - eiw ~ 1)] log[l ~ e{w - 1)] > 0. 
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Now we will show that (A. 5) actually is a necessary condition. Assume that condi- 
tion (A. 5) is not satisfied. If the first term in (A. 4) is < 0, then log MAP{A^) 
— cx). If the first term is zero, using (A.3), we see that, 



(A.6) 



log MAP{A") > 
log MAP {A") < 



d 



wj — w 



log N, and 
logA^. 



In (A.6), the coefficient of log A^ is negative. Thus, as A^ — > oo, MAP{A'^) is bounded 
by 0. So condition (A. 5) is also a necessary condition for MAP{A'^) — > oo. 



Appendix B: Outline of proof for Theorem 2.2 



Using the Lagrangian method with the restrictions {J2i=i ^oi = 1) and {J2i=i = 
1), we get O^i = ki = 2 3i,s the optimal solution. To verify that this is the maximum 
value, let 



F{eo, k) = - hew) log(6'oi - hew) - ^ 6*01 log 6*01. 



Then. 



> 0, 



dk'f Ooi — kiCW 

The determinant of the Hessian matrix is 



> 0, and 



< 0. 



dOoidki Ooi — ki cw 

where 



d^F 




A 


B 


aaak 




B 


K 



A = diag{A,)- {A, = f^f ), B = diag{B,); {B, = q§^) and K = diag{K,)- 
{K, ^ ^). Hence \\H\\ = \A\\K - BA-^B\ = \A\\diagiMi)l where 



M, 



9 9 
e w 



9oi - hew {Oqi - hcw)'^ 



(eo.-A 



kicw Ooi 



CW 



, . - I < 0. 

yoi " hew \ h 



Hence, is negative definite, and the expression in (2.5) corresponds to the 

maximum attained by F. 



Appendix C: Outline of proof for Theorem 2.3 



Let n denote the sum of word frequencies under model A^^, i.e. n = X]i=i ^u- Let 
m be the number of word instances of type q+\ and I be the corresponding number 
of type q, q > d where the alphabet size is d as previously. Ideally, if model Mq+i 
is true, when n — *■ 00, with Pq+i^ logMAP^+i — logMAPq 00; while 
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log AI APq+i — \ogMAPq — * (or is bounded above) when = 0. By definition, 



logMAP^i-logAMP^ 

nSr(iV2, + ft)r(E?=iV, 



log 

- log 
+ log 



" nLir(^i. + A)rg:L,A) 
.r(ELi[^H + A])nLim) 
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(C.l) 



If log 



n,tir(7,) 



, (where fc = q + 1 — d) . 



If 



N 



0, then in (C.l), ^ - ^ ^ for i < g. Then (C.l) becomes 



log[r(E-;i A)] - log[r(ELi A)] - logr(/3,+i) = log Beta(ELi a suf- 

ficient condition for logAf AP^+i - log MAP, < is > 1, ELi A > 1- ^ow 
we derive the monotonicity property of the MAP score, i.e. when Pq+ij un- 

der model Aiq+i, logMAP^+i — logMAPg — > oo. Leaving out the constant terms, 
(C.l) reduces to 



log 



nti nN2^ + A)r {N2,q+i + Pq+i) T (EUiWu + A]) 



5^ log 



n^=ir(c»jfc + 7j) 



Let us denote m = N2.q+i = E^i^fe^* ^ 1' 



Then ELi + 



Ei=i -^ij = Using Stirling's approximation, (C.l) reduces to 
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m + Pq+l 
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^ ' log (m + - 1) 
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log (TVh + a - 1) 
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since, for i = cZ + 1, . . . , g, we have Nu = N2i. Leaving out constants and terms 
tending towards as Nu — > oo, N2i — *■ oo, we can write (C.2) as 



E 



N2^ + A - log {N2i) - (nu + P^-\]\og {Nu) 



+ \rn + Pq+i - i j log (m + - 1) 



9+1 ^ 

(u.-l)m + ^ft-- 
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log 
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+ [m{w - 1) - /3g+i] log(n) 



■E 



^ ( c^jfe + 7j - ^ ) log (cjjfc + 7j - 1) 
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m 



log I TO + ^ 7j - 1 



Now we evaluate each of the terms as n ^ cxd, to ^ oo, with ^ — > Pq+i- Assume 
that the proportion contributed to motif (q + 1) by each base i, « = 1, . . . , d is k^, 

such that J—^ WKi, i = 1, . . . , d, J2i=i = 1 and, as before, n — raw = 

^1^1 N2i. Also, assume that all instances of the motif arc exact, i.e. Cijk = to for 
some j e 1, . . . , c? Vi = 1, . . . , w. Also, for all let 7(1) = 7j if Cijk = to. So, 
finally, simplifying terms in (C.3), we get. 



{{Pl - KiWPq+l)\0g{pi " KiWPq+l) - logpi} 

.1=1 

+ [1 - (W - l)Pq+l] log[l - (W - + logPq+1 



^7(i) - w^7j 

i=i 



i=l 



(C.4) +nlogn[-i(; + 1 + - l]/9g+i + logn 

Let r denote 

d 

^ {(Pi - log(pi - KiWPg+l) - Pi log Pi} 

1=1 

+P<J+1 logPg+1 + [1 - {-^ - l)Pg+l] l0g[l - (w - l)Pg+l]. 

Then (C.4) — > 00 at a rate exponential in iff r > 0. If r = 0, the coefficient 

of logn, Er=i7(*) - wY.]=ili < so logMAFg+i - logA/AP, ^ 0. Also, 
by a similar argument as in Theorem 2.2, the maximum value is attained when 
Ki = Pi = i, for i = 1, . . . , d. 
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